Advances in Ngram-based Discrimination of Similar Languages
نویسندگان
چکیده
We describe the systems entered by the National Research Council in the 2016 shared task on discriminating similar languages. Like previous years, we relied on character ngram features, and a combination of discriminative and generative statistical classifiers. We mostly investigated the influence of the amount of data on the performance, in the open task, and compared the twostage approach (predicting language/group, then variant) to a flat approach. Results suggest that ngrams are still state-of-the-art for language and variant identification, that additional data has a small but decisive impact, and that the two-stage approach performs slightly better, everything else being kept equal, than the flat approach.
منابع مشابه
Experiments in Discriminating Similar Languages
We describe the system built by the National Research Council (NRC) Canada for the 2015 shared task on Discriminating between similar languages. The NRC system uses various statistical classifiers trained on character and word ngram features. Predictions rely on a two-stage process: we first predict the language group, then discriminate between languages or variants within the group. This year,...
متن کاملمقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کاملSwordfish: Using Ngrams in an Unsupervised Approach to Morphological Analysis
Morphological analysis refers to the art of separating a word into its base units of meaning, or morphemes. Many popular approaches to this, including Porter’s algorithm, have been rule-based. These rule-based algorithms however, generally only perform stemming, the identification of root morphemes, which is only a part of morphological analysis. Such algorithms can only reasonably be applied t...
متن کاملNon-linear Mapping for Improved Identification of 1300+ Languages
Non-linear mappings of the form P (ngram)γ and log(1+τP (ngram)) log(1+τ) are applied to the n-gram probabilities in five trainable open-source language identifiers. The first mapping reduces classification errors by 4.0% to 83.9% over a test set of more than one million 65-character strings in 1366 languages, and by 2.6% to 76.7% over a subset of 781 languages. The second mapping improves four...
متن کاملPhonotactic Modeling of Extremely Low Resource Languages
This paper presents a novel approach to low resource language modeling. Here we propose a model for word prediction which is based on multi-variant ngram abstraction with weighted confidence level. We demonstrate a significant improvement in word recall over ”traditional” KneserNey back-off model for most of the examined low resource languages.
متن کامل